Moving the Bytes with Bao

Rüdiger Klaehn

n0.computer

2023-04-15

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

Antoine de Saint-Exupéry

Outline

  • Iroh primitives
  • What do we want
  • How does bao enable this
  • Applications
  • Is this IPFS?

Iroh primitives

  • Iroh v2 was started in January 2023
  • Already used in production
  • Goal is to be extremely simple

Blobs

  • arbitrary size, from bytes to terabytes
  • filestore pattern by default
  • data itself remains under user control

Collections

  • collections are blobs
  • sequence of links (with some metadata)
  • arbitrary size, billions of links possible
  • links typically to blobs
  • same hash algo (blake3) for all links

Verified streaming

  • only send content for requested hash
  • incremental verification
  • validate on send
  • validate on receive

Data transfer protocol

  • request/response
  • not a discovery protocol
  • verified streaming
    • verify on send and recv

Request scenarios

Request

  • root hash
  • ranges for each element
    • collection first, at offset 0
    • children in order of links

Response

  • concatenated response for each non empty item
  • just the bytes?
    • not verified streaming
  • enter bao

What is blake3

  • Cryptographic hash function
  • Internal ephemeral merkle tree
  • Binary tree
  • 1024 byte chunks

What is Bao

  • bao is blake3
  • tree is persisted
  • stored either
    • in separate file (outboard)
    • mixed with the data
  • we use outboard

Verified streaming?

Provider

  • get request
  • traverse relevant part of tree
  • in pre-order
  • send hash pairs and chunks

Requester

  • tree structure depends only on size
  • read size
  • use tree iterator to figure out what to expect
  • validate every time a new piece is received
  • detect corruption after 1 item (pair or chunk)

Summary

  • Provider
    • traverse tree
    • read from data and outboard
    • validate and write to socket
  • Receiver
    • traverse tree to know what to expect
    • read from socket and validate
    • use data (stream or store)

Performance

  • blake3 is fast
  • just 2 files
  • sequential access pattern
  • seek forward in data, once for each gap
  • seek forward in outboard, worst case tree height
  • close to physical optimum
  • limit is quic + encryption
  • validate on send
  • Measuring on the fast track

Overhead

  • best case, large part or entire file
    • 1/256 of data size
    • 4 GiB for 1 TiB file
  • worst case, 1 chunk in a 1 TiB file
    • 64 * log2(chunks)
    • 1920 bytes

Chunk groups

  • outboards can get quite large
  • 1/16 of data, 64 GiB for 1 TiB file
  • idea: store only higher levels the tree
    • outboard fits in mem even for giant files
    • also reduces overhead when sending

Applications

Future plans

  • append only
    • blobs
    • collections
  • arbitrary writes
    • sync live db images
    • single writer mutable file system

Limitations

  • deep and highly dynamic DAGs
    • rabin chunking
    • prolly trees
  • you can put collections in collections, but
    • currently no plan for dag aware gc

Linux kernel

stats for linux.car

  • branch size: 4740977

  • leaf size: 1293774012

Wikipedia math

stats for wiki_math.car

  • branch size: 33407570

  • leaf size: 3541099753

Lotus testnet

stats for lotus_testnet.car

  • branch size: 414198
  • leaf size: 23932

Is this still IPFS?

  • No

  • Yes

Content-addressing

  • every byte is content-addressed
  • incremental verification everywhere
  • we don’t even trust the file system
  • hence validate on send

Questions

Extra

Components

  • libp2p -> quic
  • unixFS: no*
    • we have file system support
  • IPLD: no*
  • Bitswap: no
  • IPNS: future plan
  • HTTP Gateway: user
    • range request support
  • DHT: future plan
  • (*) DAG gc support: monetized use case

Post-order outboard

  • pre-order outboard
    • moves nodes around when adding
    • not good for append only
  • post-order outboard
    • nodes stay fixed when appending data
    • good for appending

What if blake3 gets broken

  • collections contain header
  • use other hash fn
  • construction of bao is independent of hash fn
  • could do it with SHA2-256
  • not our biggest concern atm

How would you support dags

  • plugin fn that takes collection and returns bitmap
  • bitmap says which links are to collections
  • then do normal dag liveness like in ipfs-embed

Header or no header

  • Header allows changing hash algo mid tree
  • No header is less stuff.

What about other hashes

  • Nogo, we rely on blake3 stream validation
  • In trusted scenarios:
    • mapping to blake3 hash
    • then use that to download
    • verify at the end

Discovery protocol

  • basically like requests
  • you get back info about what is there
  • range spec seq